Lending club data Exploratory Data Analysis, feature engineering and cleaning

Since the supplied csv is huge I am going to read it in chunks

Visualizing types of data

Now checking which columns have a high number of NaN

Removing columns where there is 100% NaN

member_id has been removed as it was 100% NaN

Visualizing the percentage of people under different grades

The donut plot above shows that a significant portion of lenders are placed in the category B followed closely by C. It might be the case that the people placed in category B are preferred by the company? This question need some investigation.

How many of them own their house, rent or Mortgage?

A significant numbers of lenders are paying their mortages--it "appears" like the lending club feel more secure while lending money to the people who are paying their mortgages followed closely by the people who are living in the rented dwellings.

What is the fico score of the lenders?

Plotting the fico score counts reveal that a significant number of people who are taking loans fall in the category 660-664 and subsequently tapered off with increasing the fico score. This may be true as the people with high fico score may not want to lend money at all.

Lending club has a high number of customers either in the Fully paid and or Current range. This means that the company's model of lending the money is working well for them--the money which is going out is coming back too with some interest paid.

By grouping the data with employment length shows that the lending club customers have signficant number of people who have a good employment history i.e. more than 10+ years.

Here I have grouped the data by grades and plotted them as bubble chart. Mean loan amount and mean installments show a linear behavior while the meant interest rate tend to grow sharply as we move from category B to G with an exception of category A which falls in between the categroy B and C.

At a more granular level how the interest rate is distributed within each category?

How the loan amount, interest rate etc is grouped with the purpose of loan?

How the median of the loan amount, interest rate and annual income is distributed accross the states?

How the defaulters and the people who returned the loan are distributed accross the states?

CA leads in the pay back of the loan followed by TX and NY

How the interest rate is distributed between different categories of loan status?

Clearly payers fall in the range where the interest is charged at a lower rate.

Uptill now I have plotted either the average values of the various features of the data in a variety of ways. Further to get a better estimate of the statistical features of the data I will make use of the violin plot for the interest rate on grade and states etc.

Reverting back to seaborn beacause size of the notebook is greater than 100 mb

Not only the grade 'A' has the minimal value of the median. The dispersion of interest is the least in category A. Clearly from the sheer interest rate point of view it is good to be in this categroy. The above plot further differentiate the data by their tem period.

Loan amount and installment are showing a mostly linear behavior but with heteroskedastic behavior.

Is there are a strong monotonic relationship between interest rate and loan amount? We can explore it using Spearmans Rank correlation coefficient.

More can be learned about it from this great resource https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php Its definition is$\frac{1-6\sum d_{i}^2}{n(n^2-1)}$ here 'd' is the distance between the ranks of the data point.

Very high spearman rank correlation means as we increase the rank of one the other will increase montonically. This is true for loan and installment. Means these two variables are highly correlated. While the other parameters doesn't increase monotonically.

We can also understand the linear dependence between variable using the pearson correlation

This is a credit default problem and the loan status is a categorical variable. That is why I will encode it first. It has 10 possible values so one hot encoding will create additional features. Without that I will just encode the Charged off and put everything to 0.

Some more cleaning and then storing the dataframe in a cleaner format for further analysis

For this project I am trying to predict the default. Therefore I can encode all other categories to 0 other than the Charged Off.

Term information shouldn't affect the loan default.

Clearly I am not going to use these descriptive text for making a credit default prediction.

ids are also not important

employment title is also going to get removed

And the zip codes too

In the EDA we saw that the loan is highly correlated with installments. It means it can be a redundant feature and may not give any additional information in a machine learning setup. Also I need to get rid of the NaN values in the dataset.

NaN values are removed in the object type data

Clearly there are some numerical features which are more than 90% NaN. I am going to filter them out and fill them with 0

For less than 90% NaN I am going to replace them with means.

I will remove the target feature to check the correlation of the rest of the variables

Now I will try to get rid of highly correlated feature as I saw with the spearman correlation between the loan and installment. This will ensure redundant information will not go into the modeling part.

I am going to drop this column too from the dataframe

A much nicer result as we can see there is no NaN entries in the correlation matrix and some of the highly correlated features has been removed. I will now focus on the highly correlated categorical features

How much information each categorical feature codifies about the target?

Thus given the last_paymnt_d, last_credit_pull_d, next_pymnt_d gives good information about the charged off status

I will remove those features which have very little information about the target.

How much correlated categorical features are? To compute the correlation between categorical value I took help of this article in which the author provides a link to his kaggle kernel https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9. From the basic equations of the pearson's correlation coefficient it can be gleaned that it is not designed to handle the categorical features. Some other method need to be defined. One such approach is the Cramer's V correlation which is a symmetrical measure and varies from 0-1.

From Cramers V correlation it can be said that some of the features are highly correlated with each other (grade and sub_grade) while the Theil U coefficient inform about the importance of a particular feature which indicates directly towards the target variable. So first I will keep those object type variables which have good enough information about the target variables and then remove the object type variables which are highly correlated among themselves.

I could have used the mean encoding. However, I decide to follow the Laplace smoothing of the local mean. It is because if there is a very low occurence of a certain instance of a feature then its local mean cannot be trusted. Then a more heavy emphasis should be given on the global mean.

Since this notebook has become very heavy. I will do all these in the other notebook where I will also perform machine learning using pycaret.